Introduction

How does music relate to lyrics? It is tempting to think that a song tries to convey some feeling or emotion, and that both the music and lyrics are there to support this message. Let me give you an example. We might expect a song with a slow beat and laid back guitar to talk about laid back topics, maybe a trip to the beach. At the other end of the spectrum, heavy metal would likely concern itself with darker, heavier subjects. However, are these suspicions even true? Let’s put some numbers to the hypothesis that there in fact is a relationship between music and lyrics. In the next sections I’ll take you through a journey where we approach this topic with a statistical mindset, harnessing all the powers that modern technology has to offer along the way.

We’ll start out with picking a large body of music and for each track in there we are going to collect the lyrics. It would be way too cumbersome to scrape all the lyrics from the internet manually, but fortunately the Musixmatch API makes it possible to query lyrics from code in a single API call. For an unpaid account only 30% of the lyrics for a queried track is returned, but that will do for our intents and purposes. The lyrics themselves are not enough however, we need to somehow capture meaning in a numerical value in order to run statistical tests on them. Therefore I assign every set of lyrics a sentimental, also known as valency, score automatically computed by the NLTK package, which offers natural language processing functionalities. A low score indicates a sad sentiment/feeling, whereas a high score a happy sentiment/feeling. To be precise, the score ranges from -1.0 to 1.0. This arms us with the capability to analyze lyrics’ mood, which is arguably one of the most important aspects of music. The more topic oriented side of lyrics analysis concerning the actual intellectual ‘meaning’ of words we’ll leave aside in this research. Equipped with a strategy on quantifying numbers, we can start to tackle the problem of how to approach computational music analysis.

Instead of zeroing in on a particular aspect of music or a niche genre or artist, we are going to keep the research broad. We’ll focus on music as a whole so we can draw interesting conclusions about music as a whole. Throughout this storyboard we’ll divvy up music into four main elements: melody, harmony, instrumentation and rhythm. For each of these we’ll attempt to confirm or deny hypotheses that are grounded in intuition, but not (yet) in science. Of course we still require a way to quantify music, but luckily Spotify is kind enough to provide a goldmine of musical data that we’ll use.

Let us dive into it!





Valencies for two hypothetical sets of lyrics:

What a sad, miserable day. It's raining and I have debts to pay!-0.79

It's the most jolly moment in a while, to treat my granny to a smile!0.84

Corpus

The first order of business is choosing the corpus of music. We have chosen a broad research question and the corpus should reflect that. Meaning it should draw inspiration from a large variety and number of songs, only then can we justify general conclusions. The reason I won’t simply pick the top-50 songs for each decade, or any top list for that matter, is because such a list would be heavily pop dominated. There is much more to music than just popular music. What is interesting is emphasizing variety as opposed to the most popular, therefore we shall cover a range of different genres. To be clear, this distinction does not exclude the most popular songs. All songs in the corpus are sourced from albums that are exemplary of the niche they occupy and popular within their genre, only not necessarily the most popular of their time. The reason I will proceed with an album oriented approach instead of including all songs from a number of artists or top list of each genre, is in the case of the former that some artists have produced hundreds, some only a dozen of songs which would imbalance the data set, and in the case of the latter the problem I discussed before (stressing variety instead of the most popular).

To summarize, we want a large corpus from a variety of genres, which, because the heavy lifting in terms of fetching the data is done by the Musixmatch and Spotify APIs, is most certainly possible. The final list of albums that are included in this research:

This totals 284 tracks and 16 hours of listening time.





* You might’ve noticed a lot of albums have been selected for this genre, but the alternative genre is extremely broad and the albums are highly diverse.

† By Miscellaneous I simply denote albums that cross the barriers of genre as they contain numerous elements of a plethora of genres.


Disclaimer: Please note that these genre classifications are not objective, but subjective and approximate.


Playlist

Discovery


Research: What are the general, high level patterns in the data?

Before we look at the individual musical elements as discussed earlier, we should try to explore the data we’re dealing with. The Spotify API offers a plethora of functionalities that range from very high to very low level. Here we will use some the the high level analyses like musical valence and loudness to learn about the corpus. In particular, it makes most sense to compare the high level musical valence feature with our computed lyrical valence value and see if there is any connection.

When we do this we find something enormously interesting. Clearly, there is a correlation between musical and lyrical valence, but not in the way one would expect. Tracks with low musical valence and tracks with high musical valence correlate with high lyrical valence, whereas musical valences the middle of the pack point to lower lyrical valence. Due to the big size of the corpus this claim carries a lot of weight. Upon closer inspection a limitation of lyrical valence becomes apparent as well, namely the NLTK API struggles with lyrics that require a deeper understanding of (cultural) context and nuance. For example, bury a friend by Billie Eilish has a lyrical valence of 0.94, despite the lyrics clearly showcasing a low lyrical valency. This only an exception though, for the large majority of songs the lyrical valence makes a lot of sense, for example a high score (0.99) for Feel Good Inc. by Gorillaz and a low score (-0.94) for Anti-Hero by Taylor Swift. This gives credence to the claim we made.

Let’s try to dive even deeper and analyze a much more low level feature of music: melody.

Melody


Research: can we identify melodies from audio data?

Intuitively, it seems that melody encodes a lot of the valency information of a song. The melody is usually the most memorable part and often indicative of the feel of a song. So it makes sense to look at the melody of two tracks, one with low and one with high lyrical valence, and investigate how melody correlates with lyrical valence. A sensible visualization tool to use is a chromogram. This captures for each moment the notes that are played, as analyzed using the fourier transform. Let’s try this and see if any melody lines become apparent.

Unfortunately, looking at the chromograms, no discernible melody is recognizable. The only thing that sticks out is the droning ‘E’ in Ball and Biscuit, but this could hardly be called a melody. If you listen to the tracks there are clear, repeating melodies present (at least to the human ear) which on first thought should give rise to repeating patterns in the chromogram. Nothing could be less true. It appears we need a different tool.

Melody/Harmony


Research: what is the saddest key?

Apparently it’s difficult to find melodies when faced with a chromogram. Instead of identifying specific melody lines, we could focus on the key in which the melody is played. Luckily Spotify gives us the key and mode of every track in our corpus, so we don’t have to compute this ourselves. When we plot the lyrical valency for each key, where the band around the bars represents the number of tracks that have that specific key, we get the bar plot to the side. What meets the eye, is a huge spike at the D sharp (or E flat) key. What could this mean? Unfortunately not a lot, because upon closer inspection it appears that songs in that key are heavily underrepresented in the corpus. There does seem to be quite a bit of variation among keys, especially the keys in B, which appear to affect lyrics in a negative way. This points to the fact that there in fact is such a thing, like “the saddest key” (which would be B major).

Although this could be coincidental and the effect may be canceled out if the corpus were much larger. Somethings that favor this conclusion are the average lyrical valencies, which converge to a lyrical valency of ~0.13. There is no significant distinction between the average major and minor mode, even though we’re always told that minor keys are “sad” and major keys are “happy”. These results deny what our music teachers have been telling us for centuries (at least, lyrically speaking)!

Harmony detection limitations


I should address some issues with automatic key matching that explain why we should take the idea that B major is the saddest key with an even larger grain of salt. Key matching might work for a lot of tracks, but there are many cases where it fails too. Where key matching fails most spectacularly, is for highly percussive tracks, which is usually the case for hip hop. This is due to the inharmonic nature of most percussive instruments. Take for instance UNTITLED by JPEGMAFIA. Upon listening, the energetic hi hats and fast drum kicks stand out. This is manifested in the corresponding keygram, where for each section every key on the y-axis is matched. The brighter the tile, the more strongly the key matched. We’d expect a straight line that changes height after a modulation. But in the UNTITLED track there is no such pattern to be found.

Another issue is brought forward by a limitation by the Spotify API, that is, only two unique modes can be distinguished by the API (the major and minor mode). This is problematic, because many artists apply many different modes to achieve a variety of effects that cannot be achieved by just minor or major keys. The song Electioneering by Radiohead is in D-dorian, which is the minor key with a raised sixth. The result is that the key lies somewhere in between D-minor and D-major, which is reflected in the keygram.

Though we could match the keys ourselves for every track in the corpus, another issue would present. To match every possible mode, the search space would become too big and our results too cluttered to identify a specific key, as multiple keys would always match somewhat.

With that researched, we shall continue to investigate instrumentation.

Instrumentation


Hypothesis: different timbres correlate with different lyrical valences.

I should rephrase the question of finding the relationship between instrumentation and lyrical valence slightly to the relationship between timbre and lyrical valence. Which instruments are playing in a song might be relatively easy for a human to figure out, for a computer this is an almost insurmountable task. The closest thing we actually can measure is timbre, also known as the ‘color’ of the sound. This definition might seem vague, but it is: timbre means anything but pitch, duration or loudness (though sometimes loudness is included).

In order for us to measure how timbre relates to lyrical valence we will compare two tracks on opposite sides of the valence spectrum. In particular, I chose Everybody’s Got Something To Hide Except Me And My Monkey and I’m So Tired written by John Lennon, because they allow us to see how the same artist writes lyrics in relation to timbre. For both these songs, the timbre-based self-similarity matrix enables us to compare timbre between sections and see how lyrical valence, which you can read next to the lines, changes when timbre changes.

However, after investigating the matrices there do not seem to be any significant discrepancies between sections, except for the outros. But those seems to be a hoaxes, as upon inspection of the lyrics of the first track the outro is just a repetition of the valence-less words ‘come-on’. As for the second track, it’s simply a repetition of two lines that have been used in previous sections. Because these two lines, which are lyrically somewhat high in valence, were isolated and repeated in the last segment, the lyrical sentiment became positive. In light of this evidence it appears that there is no connection between timbre and lyrical valence. Maybe interesting patterns could be detected if more tracks were compared, as it stands none have been observed.

The next thing to try is rhythm.

Rhythm


Hypothesis: higher tempo songs tend to be more aggressive and slower songs more sensual.

Let’s put this one to the test. For this hypothesis we’ll denote songs that have a lower BPM than the median (< 115.9BPM) as slow songs, and the remainder as fast songs (≥ 115.9BPM).

So far we’ve explored only the lyrical valency property, but not the lyrics themselves. We might gain some new insights if we look at the lyrics directly, so let’s try it. One of the most useful tools for visualizing patterns in textual data is a so-called word cloud, which you can see to the side. The words in blue refer to words that occur very frequently in fast songs relative to slow songs, and vice versa for the red words.

Immediately we can see instances that prove the hypothesis. Slow song words include sensual words such as number (as in, someone’s phone number), kiss, boy and hot. These are words we would expect to encounter in a love song. Though what stands out is that love is included in the fast songs. There are also some odd ones out like bones. As for the fast tracks we also find what one would expect, e.g. aggressive words like kill, gun and ill. Also, very noticeably, we find numerous verbs and filler words. This makes sense in a track where the singer (or rapper) has to keep up the pace in a high BPM track, and it’s easiest for the listener and artist to reuse many of the common verbs and filler words to keep the information stream somewhat limited.

Most of the data in this plot seems to confirm the hypothesis (though there are exceptions, like love among the fast tracks).

AI: hyperparameter tuning


Research: are there some hidden relationships that have yet to be found?

So far we have plotted numerous relationships between different variables and debunked or confirmed a number of hypotheses. Though the reason I picked those visualizations is because there seemed to be potential for an interesting correlation, there is still a chance there exist some totally unexpected patterns. Because these may escape a mere human like me, maybe the right machine learning tool can pick them up. So that’s what we’ll be trying.

The technique we’ll use is one of the most successful machine learning algorithms of the modern day, called extreme gradient boosting (XGBoosting). In essence, it’s an ensemble of many small decision trees that boost each other to achieve superior results. To uncover secret relationships we will train the network, using regression, to predict the lyrical valency of a song based on a whole array of inputs that the Spotify API delivers (like mode, tempo, musical valence, etc.). Before we dive in and use it, some preparations need to be taken care of. First, the corpus is split into a train set (to train the model) and test set (to evaluate the model). Next up, for XGBoosting to work we need to list a number of hyperparameters. The idea is to tune these hyperparameters, by training the model for many different combinations of hyperparameters and evaluating each using cross validation. In the plot you can see how well each parameter value works: the lower RMSE, the better. We automatically pick the optimal combination of hyperparameters so we can properly train and evaluate the final model.

AI: results


After training the model, we end up with a RMSE (root mean square error) of 0.7459795. This means that the model, on average, is that far off from the correct answer. As a reminder: the lyrical valency ranges from -1 to 1. It could well be that this still sounds quite abstract. As a comparison I also evaluated a model that always makes a random guess. That model performs with a RMSE of 0.8341123, which is significantly worse. Therefore, the XGBoosting model must have found some pattern. That makes it worth looking at whatever it found. One valuable piece of information that we can extract from the XGBoosting model, is the set of feature importances it learned. They tell us how important each input parameter is deemed for predicting the lyrical valency, which you can see in the plot.

Many of the findings we found ourselves already. We already discovered that the mode of the key does not matter. We found that musical valence and tempo correlate with lyrical valency. More interesting is what we did not find. Apparently, according to the model, energy is the most predictive factor of the lyrical valency, even more so than musical valency. This makes sense intuitively. High energy songs might be more likely to have more energetic lyrics. The model also judges loudness and danceability as somewhat important features. A reason could be that the genre of a track defines in which range those features, including energy, belong, and that the genre also defines what the lyrics are generally about.

Conclusion

After exploring the four elementary facets of music, i.e. melody, harmony, instrumentation and rhythm, we gained numerous insights on the question: how does music relate to the lyrics? We’ve seen some musical aspects that correlate with properties of lyrics and some that don’t. On a high level we’ve learned that musical valence is a good predictor of lyrical valence. But can we figure out what low level features are at the root of this? Though melody lines are hard to quantify, we’ve looked at all 24 minor and major keys and found that surprisingly the mode actually does not make a difference. Among keys there seems to be some variance in lyrical valence, but this could be eliminated with an even bigger dataset. Furthermore, we’ve found no reason to suspect that instrumentation, measured by timbre, has an effect on lyrical valence or vice versa. Tempo, as a matter of fact, has a clear influence on lyrics. This could be explained by the fact that many subgenres stick to approximately the same tempo and deal with the same lyrical subjects, for example some parts of rap. Lastly, AI helped us to find patterns that we did not find ourselves, like that energy is highly predictive of lyrical valence. In summary, there are definitely relationships between lyrics and music, though not in all areas.

That being said, the last word hasn’t been spoken on this subject. For future research it would be extremely interesting to delve deeper into more niche questions, like how Latin music lyrically relates to music from the USA. Dissecting the interconnectedness of lyrics and music may uncover countless and much deeper understandings of cultures across the globe. Furthermore, research like this attempts to discover how our feelings (music) connect with our intellect and analytical skills required for verbal processing (lyrics). Continued research will likely yield valuable knowledge about the human brain and what makes humans human.